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Abstract — In this paper, we further develop a family of 
parallel time integrators known as Revisionist Integral De- 
ferred Correction methods (RIDC) to allow for the semi- 
implicit solution of time dependent PDEs. Additionally, we 
show that our semi-implicit RIDC algorithm can harness the 
computational potential of multiple general purpose graphical 
processing units (GPUs) in a single node by utilizing existing 
CUBLAS libraries for matrix linear algebra routines in our 
implementation. In the numerical experiments, we show that 
our implementation computes a fourth order solution using 
four GPUs and four CPUs in approximately the same wall 
clock time as a first order solution computed using a single 
GPU and a single CPU. 

ATeyHwrfs-Advection-Diffusion, Reaction-Diffusion, integral 
deferred correction, parallel integrators, graphics processing 
units 

I. Introduction 

RIDC methods are parallel-in-step time integrators [4], 
[6]. The "revisionist" terminology was first adopted in [4] to 
highlight that (i) this is a revision of the standard integral (or 
spectral) defect correction (IDC or SDC) methods [8], [7], 
[5], [9], [18], [22], and (ii) successive corrections, running 
in parallel but lagging in time, revise and improve the 
approximation to the solution. This notion of time paral- 
lelization is particularly exciting because it can be potentially 
layered upon existing spatial parallelization techniques [3], 
[26], including algorithms that utilize GPU cards to solve 
time dependent PDEs [16], [25], [1], to add further parallel 
scalability. 

The main idea behind RIDC methods is to re- write the 
defect correction framework [27], [28] so that, after initial 
start-up costs, each correction loop can be lagged behind 
the previous correction loop in a manner that facilitates 
running the predictor and correctors in parallel. This idea 
for parallel time integrators was previously published by 
the present authors in [4], [6]. As before, this is still small 
scale parallelism in the sense that the time parallelization is 
limited by the order one wants to achieve. 

To harness the computational potential of multiple graph- 
ical processing units (GPUs) on a single node, the CUBLAS 
library [23] (which is a collection of linear algebra subrou- 
tines coded in CUDA) are utilized to demonstrate that by 
threading the RIDC loops, multiple GPUs can be utilized 



for our semi-implicit RIDC algorithm. We present numerical 
experiments in Section IV to show that our algorithm and 
implementation computes a fourth order semi-implicit solu- 
tion using four GPUs and four CPUs in approximately the 
same wall clock time as a first order forward-backward Euler 
solution computed using a single GPU and a single CPU. We 
stress that our parallel speedup comes from a unique way to 
utilize existing parallel libraries, in this case the CUBLAS 
libraries provided by NVIDIA. Unless data decomposition 
is used (whether in the host code or within a new parallel 
library), one could not use multiple GPU cards in as simple 
a fashion using a sequential ARK integrator. We believe 
that this work will provide the scientific community with 
a straightforward way to add further parallelism to existing 
software that generate low order (in time) solutions to time 
dependent PDEs using only a single GPU card. 

Readers might be familiar with parareal integrators [13], 
[14], [19], [21], [15], another family of parallel time inte- 
grators. In such methods, the time domain is split into sub- 
problems that can be computed in parallel, and an iterative 
procedure for coupling the sub-problems is applied, so that 
the overall method converges to the solution of the full 
problem. Parareal integrators are philosophically different 
from RIDC methods. While parareal methods allow for 
large scale parallelization, there are non trivial choices of 
the fine and coarse predictor that affect convergence to 
the desired solution. RIDC methods on the other hand, 
guarantees convergence, and high order solutions. A class 
of parareal methods by Mike Minion and Matthew Emmett 
(CAMCOS, 2012) are potentially a generalization of the 
RIDC methods, but further analysis would be needed to 
validate that statement. 

This paper is organized as follows: In Section II, we re- 
view Implicit-Explicit (IMEX) methods, which are a family 
of high order semi-implicit integrators [2], [17]. In Sec- 
tion III, semi-implicit RIDC methods and their properties are 
presented. Then, numerical benchmarks comparing RIDC 
and additive Runge-Kutta methods are given in Section IV, 
followed by concluding remarks in Section V. 



II. IMEX Methods 

We are interested in solutions to initial value problems of 
the form, 

f y'(t)=f s (t,y) + f N (t,y), te[a,b], 
\ y(a) = a. 

where y, a e M n , f N : M x E™ ->■ E" and f:IxIM 
R n . The function / s (t, y) contains stiff terms that need to be 
handled implicitly, and f N (t,y) consists of non-stiff terms 
that can be handled explicitly. A first order implicit-explicit 
(IMEX) discretization of the IVP (1) can be written as 

Vn+1 M Vn = f S {t n+1 ,y n+1 ) + f N (t n ,y n ), with y = a. 

The above IMEX discretization is particularly useful if 
f (t, y) is linear in y, i.e. f s (t, y) = Dy, which is often the 
case in a method of lines discretization of PDEs containing 
relaxation terms. In such cases, the IMEX discretization 
reduces to a linear system solve the solution at each time 
level, 

(J - DAt)y n+1 = y n + Atf N (t n ,y n ), with y = a. 

(2) 

An s-stage diagonally implicit RK (DIRK) and explicit s- 
stage explicit RK method are coupled 
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to generate a high order semi-implicit integrator. The dis- 
cretization of IVP (1) using an IMEX method can be written 

as 

s 

2/n+i = Vn + A* X! ( b i K ni + b ? K ni) > with Vo = 

i=l 

where the stages satisfy 

i i-1 

K? u = f s (t m7 y n + AtY, 4 R t + At E 

i=i 3=1 

i i— 1 

K» = f N (t m , y n + AtJ24 K %j + At T,< K ni)- 

3=1 i=i 

with t n i = t + c-iAt. The third and fourth order IMEX 
methods from [20] are used to benchmark against our fourth 
order RIDC-FBE (RIDC constructed using forward and 



backward Euler integrators). The third order IMEX method 
is constructed from the following Butcher tableaux: 
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The fourth order IMEX method is constructed from the 
following DIRK method 
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and explicit RK method 
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(Note, a second order IMEX scheme was also tested, but not 
presented because of its poor stability constraints). Similar to 
before, if f s (t, y) is linear in y, then each stage computation 
reduces to a linear solve since a DIRK method was paired 
with an explicit integrator in the discussed IMEX methods. 

III. RIDC Methods 

RIDC methods are a class of time integrators based 
on integral deferred correction [10]. RIDC methods first 
compute a prediction to the solution ("level 0") using low 
order schemes (e.g. a first order implicit-explicit method) 
followed by one or more corrections to compute subsequent 
solution levels. Each correction revises the solution and 
increases the formal order of accuracy by 1, if a first order 
implicit-explicit integrator is used to solve the error equation. 
Each correction level is delayed from the previous level as 
illustrated in Figure 1 - the open circles denote solution 
values that are simultaneously computed. This staggering in 



time means that the predictor and each corrector can all be 
executed simultaneously, in parallel, while each processes a 
different time-step. 
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Figure 1: (RIDC4-FBE) This plot shows the staggering 
required for a fourth order RIDC scheme, constructed using 
a first order implicit-explicit predictors and correctors. The 
time axis runs horizontally, and the correction levels run 
vertically. The white circles denote solution values that are 
simultaneously computed, e.g., core is computing the 
prediction solution at time f„ + 2 while core 1 is computing 
the 1st corrected solution at time f n +i, etc. 

In Section III-A, we first derive the error equation. Then, 
Section III-B and Section III-C give numerical schemes for 
solving the IVP and the error equation. In Section III-D, we 
review theorems related to the formal order of accuracy that 
follow trivially from [5], and in Section III-E, we summarize 
starting and stopping details for the RIDC algorithm as well 
as the notion of restarts. 

A. Error Equation 

Suppose an approximate solution 77(f) to IVP (1) is 
computed. Denote the exact solution as y(t). Then, the error 
of the approximate solution is 

e(t) = y(t)- V (t). (3) 

If we define the residual as e(f) = ?/(<) — / s (f, 77(f)) — 
/^(f, 77(f)), then the derivative of the error (3) satisfies 

e'(t) = y'(t)-r,'(t) 

= f s (t,y(t)) + f N (t,y(t))-f s (t, v (t)) 

-f N (t, V (t))-e(t). 
The integral form of the error equation, 



;(f) + f e(r) dr 

J a 



= f (f, 77(f) + e(f))-/ s (f, 77(f)) 
+f N (t lV (t) + e(t))-f N (f, 77(f)), 



(4) 



can then be solved using the initial condition e(a) = 0. 



B. The predictor 

To generate a provisional solution that can be corrected, 
a low order integrator is applied to solve IVP (1); this 
process is typically known as the prediction loop. The first- 
order IMEX scheme reviewed in Section II will be used to 
generate our RIDC-FBE (RIDC forward and backward Euler 
method) though in theory, any IMEX methods reviewed in 
Section II can be used. We adopt the following notation: 



J°J -„[«]+ A/ f S (t ^ 77 l ° J )+At f N (t 77 [01 ) 

(5) 

where the superscript M indicates this is the solution at 
level 0, the prediction level. This non-linear equation can 
be solved using Newton's method. 

C. The corrector 

The correctors are also low order integrators, but are 
used to solve the error equation (4) for the error e(f) to 
an approximate solution 77(f). Since the error equation is 
solved iteratively to improve a solution from the previous 
level, each correction level computes an error e^~^(t) to 
the solution at the previous level ift^ (f ) to obtain a revised 



solution rf\(t) = rf- x \t) 



,[3-1] 



(f). 



A first order IMEX discretization of the error equation (4) 
(after some algebra) gives 



'ln+1 



V 



l.j] 



At 



f s (t n+uV % ] +1 ) + f N (t n ,r,W) 

Atnwi^K 1 ) +/"(*«> 



(6) 



/(t, ^-i]( T )) dr. 



The integral J^ n+1 f(r, rft ^ (t)) dr is approximated using 
quadrature. For the j th correction loop, (j + 1) nodes are 
needed in the stencil to accurately approximate the integral. 
There are various choices for the stencil, but in practice, the 
stencil should include the nodes f„ and f„+i. We make the 
following choice for selecting our quadrature nodes: 



f(r,rj y - 1] {T))dr 



(7) 



„b'-i] 1 

+ l-k, 'ln+1-k) 



if {n > j - 1) 

ZLo<*n k (f N (t k ,V [ t 1] ) + f S (t k ,vt 1] )), 

if (71 <j -I) 



where the quadrature weights are given by 

rU+1 — (t-Wi- 



Oink 



n 



-1-fc 



t n +l— i) 



dt, 



for n > j — 1, k = 0, 1, . . . ,j — 1, and 



fc = 0,l,...,j-l 



for n < j : — 1. Since uniform time steps are used in the 
computation, then only one set of quadrature weights needs 
to be computed, stored, then used as necessary. 

D. Formal order of accuracy 

The analysis in [5], proving convergence under mild 
conditions for IDC-IMEX methods, extends simply to these 
RIDC-IMEX methods. 

Theorem DLL Let f(t,y) and y(t) in IVP (1) be suffi- 
ciently smooth. Then, the local truncation error for a RIDC 
method constructed using a first order IMEX integrators for 
the predictor and (p — 1) correction loops is 0(At p+1 ). 

Theorem DL2. Let f(t,y) and y(t) in IVP (1) be suffi- 
ciently smooth. Then, the local truncation error for an RIDC 
method constructed using uniform time steps, a p th -° r der 
ARK method in the prediction loop, and (p%,P2, ■ • • jPj)*" 
order ARK methods in the correction loops, is 0(At p+1 ), 
where p = Y%=oPi- 

E. Further Comments 

During most of a RIDC calculation, multiple solution 
levels are marched in a pipe using multiple computing 
CPUs/GPUs. However, the computing nodes in the RIDC 
algorithm cannot start simultaneously: each must wait for 
the previous level to compute sufficient 77 values before 
they can be marched in a pipeline fashion. By carefully 
controlling the start-up of a RIDC method, one can minimize 
the amount of memory that is required to march the nodes in 
a pipeline fashion. In our implementation, the order in which 
computations are performed during start-up is illustrated in 
Figure 2 for a fourth order RIDC constructed with first order 
IMEX predictors and correctors. The j th processor (running 
the j* correction) must initially wait for j(j + l)/2 steps, 
e.g., node 2 has to wait 3 steps before starting. There are also 
idle computing threads at the end of the computation, since 
the predictor and lower level correctors will reach = b 
earlier than the last corrector. 

An important notion to consider is "restarts", that is, 
instead of computing all the way to the final time, we 
compute on some smaller time interval to time t*, and use 
the most accurate solution to restart the RIDC computation 
at t = t*. In practice, restarting improves the stability of 
the semi implicit RIDC scheme, and could lower the error 
constant of the overall method, but at the cost of decreasing 
the speedup due to the additional cost of starting the RIDC 
algorithm multiple times. 

Additionally, one cannot increase the order of RIDC in- 
definitely as (i) it is not practical (when would one ever want 
a 16th order method?) and (ii) the Runge phenomenon [24], 
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Figure 2: This figure is a graphical representation of how 
the RIDC4-BE algorithm is started. The time axis runs 
horizontally, the correction levels run vertically. All nodes 
are initially populated with the initial data at to. This is 
represented by computing step (enclosed in a circle). At 
computing step 1, node computes the predicted solution at 
t\. The remaining nodes remain idle. At computing step 2, 
node computes the predicted solution at time t-x, node 1 
computes the 1st corrected solution at time t\, the remaining 
two nodes remain idle. Note that in this starting algorithm, 
special care is taken to ensure that minimum memory is 
used by not letting the computing cores run ahead until they 
can be marched in a pipeline; in this example, when node 
3 starts computing t%. 



which arises from using equi-spaced interpolation points, 
will eventually cause the scheme to become unstable. In 
practice, 8th and 12th order RIDC methods using double 
precision do not suffer from the Runge phenomenon. 

Lastly, there is another family of parallel time integrators, 
known as parareal methods [21], that is actively being 
researched [11], [12]. These methods fall into the class of 
"parallel across the method" algorithms, where the entire 
time domain is split across multiple nodes, a coarse operator 
is run in serial, followed by a parallel correction update. We 
encourage users to more carefully consider parareal if the 
small scale parallelism offered by RIDC is not sufficient. 

IV. Numerical Examples 

Advection-reaction-diffusion equations have been widely 
used to model chemical processes across many disciplines. 
Here, we present two numerical examples: an advection- 
diffusion and a reaction-diffusion equation, to validate the 
order of accuracy of the RIDC4-FBE scheme, and the 
speedup obtained in the parallel OpenMP framework, and 
the OpenMP-CUDA hybrid framework. In each example, 
the stiff term is chosen as the diffusion operator, f s (t, y) = 
Vxx- Applying a centered finite difference operator to ap- 
proximate d xx reduces each RIDC/ARK step to a series 
of decoupled linear system solves. The matrices are pre- 
factored into their QR components so that each linear solve 
is reduced to a matrix-vector multiplication and a back solve 
operation. 

The computations presented were performed on a stand 
alone server containing a quad core AMD Phenom X4 
9950 2.6Ghz processor with four Nvidia Tesla GPU C1060 



cards (960 total GPU cores). (Superior speedups will be 
observed with the newer Nvidia Tesla M2090 cards that are 
rated at 665 Gflops at double precision, compared with the 
legacy Ml 060 cards that are rated at 78 Gflops at double 
precision.) The ARK schemes are coded using plain C++ 
with (i) a homegrown linear algebra library and (ii) the 
CUBLAS 4.0 library [23]. The RIDC4-FBE is coded in 
C++ with (i) OpenMP and the homegrown linear algebra 
library, and (ii) OpenMP and the CUBLAS 4.0 library. Some 
important subtleties for creating a hybrid OpenMP - CUDA 
RIDC code are: (i) we can control which GPU card is 
used for a linear solve by calling the cudaSetDevice ( ) 
function, and (ii) in using "#pragma for" loop to spawn 
individual threads for each prediction/correction loop, we 
have to utilize static scheduling. 

We note that to illustrate the effectiveness of parallel 
time integrators in our numerical examples, we chose ID 
problems so that by taking a fine spatial resolution, the 
temporal discretization error would dominate the spatial 
error. Solutions to higher dimensional solutions are practical 
with the newer available Fermi/Keppler cards which have 
more onboard memory. 

A. Advection-Diffusion 

We first consider the canonical advection-diffusion prob- 
lem to show that we can achieve designed orders of accu- 
racy for our RIDC-FBE algorithm. The constant coefficient 
advection-diffusion equation, 

u t = cu x + du xx , x e [0, 1], t e [0, 40], 
u(x,0) = 2 + sin(27ra), 

with periodic boundary conditions, is discretized using the 
method of lines methodology. Specifically, the advection 
term is discretized using upwind first order differences, and 
the diffusion term is discretized using central differences. 
The following system is then recovered: 

u t = Au + Du, u(0) = a, 

where the matrix A approximates the advection operator, and 
the matrix D approximates the diffusion operator. We choose 
the obvious splitting, f N (t,u) = Au, and f s (t,u) = Du. 
We take c = 0.1, d = 10 _3 ,Ax = jjJqq- First, we show 
in Figure 3 a that ARK and RIDC4 achieve their designed 
orders of accuracy. Observe that the error coefficient for 
ARK4 is several orders of magnitude smaller than that of 
RIDC4. In Figure 3b, we instead plot the results from the 
same numerical run, this time plotting error as a function of 
the wall clock time. Several observations can be made: (i) for 
all the schemes, our GPU implementation is approximately 
an order of magnitude faster than the CPU implementa- 
tion, (ii) RIDC4 (both the CPU and GPU implementations) 
compute a fourth order solution in the same wall clock as 
the FBE solution, (iii) for a fixed wall clock time, RIDC4 



(with 4 GPUs) computes a solution that is several orders of 
magnitude more accurate than the solution computed using 
ARK4 (with 1 GPU). 




10 10 10 

Nt 

(a) Convergence study: error versus number of time steps 



-e- FBE — CPU 

ARK3 — CPU 
-9-ARK4 — CPU 
-9-RIDC4 — 4 CPUs 
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ARK3 — GPU 
-A-ARK4 — GPU 
-A- RIDC4 — 4 GPUs 




wall clock time 
(b) error versus wall clock time 

Figure 3: (a) This standard "error versus step size" con- 
vergence study for RIDC4-FBE (with 10 restarts) and the 
various ARK methods presented in section EL All schemes 
achieve their designed orders of accuracy. Observe that the 
RIDC4-FBE error coefficient is much larger than that of 
the ARK4 scheme. This is a small price to pay for the 
parallel speedup that can be obtained, as shown in (b). Two 
observations should be made: (i) for all the schemes, our 
GPU implementation is approximately an order of magni- 
tude faster than the CPU implementation, (ii) RIDC4 (both 
the CPU and GPU implementations) compute a fourth order 
solution in the same wall clock as the FBE solution. 



We also show in Figure 4 the error of RIDC4 as a function 
of restarts. As expected, the error decreases as the number 
of restarts is increased. The penalty for each restart is having 
to fill the memory footprint at each restart before marching 
the cores/GPUs in a pipe. 

Table I summarizes the speedup that is obtained when 
RIDC4 is computed using one, two and four CPUs, and 
when RIDC4 is computed using one, two and four GPUs. 
We appear to obtain almost linear speedup, even with 10 
restarts. This scaling is not surprising since a bulk of the 
computational cost is due to the linear solve and data transfer 
between host and GPU memory is not limited by bandwidth 
for our example. 

The percentage of time that GPUs spend calling CUBLAS 
kernels are summarized in the Table II. As the table indi- 
cates, data transfer is a minimal component of our CUDA 
code. 




Num Restarts 

Figure 4: The error of RIDC4 schemes at the final time 
T = 40 decreases as the number of restarts is increased (for 
a fixed number of time steps, in this case, 4000 time steps). 
Each restart requires that the memory footprint be refilled 
before the cores/GPUs can be marched in a pipe. 



# CPUs 


Speedup 


1 


1.0 


2 


1.89 


4 


3.81 



# CPUs & CPUs 


Speedup 


1 


1.0 


2 


1.97 


4 


3.88 



Table I: Speedup of RIDC for the advection-diffusion prob- 
lem. 



B. Viscous Burgers' Equation 

We also consider the solution to viscous Burgers' equa- 
tion, 

Ut + \ (« a ) x = «*xx> (x,t) E [0,1] x [0,1], 
with initial and boundary conditions 

u(0, t) = u(l, t) = 0, u(x, 0) = sin (2ttx) + — sin (irx). 

The solution develops a layer that propagates to the right, as 
shown in Figure 5. The diffusion term is again discretized 
using centered finite differences. A numerical flux is used 
to approximate the advection operator, 

2 KKi))x ~2 Ax 



kernel 


calls 


% GPU time 


trsv_kernel 


5061 


73.3% 


gemv2N_kernel_ref 


11181 


20.47% 


gemv2T_kernel_ref 


5061 


3.98% 


axpy_kernel_ref 


24200 


1.15% 


memcpyHtoD 


8243 


0.61% 


memcpyDtoH 


5061 


0.43% 



Table II: Profiling our GPU code for the advection-diffusion 
problem. 



1.5 




1 0.5 1 



Figure 5: Solution to Burgers' equation, with e = 10~ 3 and 
Ax = jfiQQ - Time snapshots at t = 0,0.2,0.4,0.6,0.8 and 
1 are shown. 



where 

/r + i/ 2 = ^(« + i) 2 + «) 2 )- 

Hence, the following system of equations is obtained, 

u t = C{u) + Du, 

where the operator C(u) approximates the hyperbolic term 
using the numerical flux, and the matrix D approximates the 
diffusion operator. We choose the splitting / (t,u) = C(u) 
and f s {t, u) = Du, and take e = 10~ 3 and Aa; = No 
restarts are used for this simulation. 

The same numerical results as the previous advection- 
diffusion example are observed in Figure 6. In plot (a), 
the RIDC scheme achieves it's designed order of accuracy. 
In plot (b), we show that our RIDC implementations (both 
the CPU and GPU versions) obtain a fourth order solution 
in the same wall clock time as a first order semi-implicit 
FBE solution. The RIDC implementations with multiple 
CPU/GPU resources also achieve comparable errors to a 
fourth order ARK scheme in approximately one tenth the 
time. 

Table III summarizes the speedup that is obtained when 
RIDC4 is computed using one, two and four CPUs, and 
when RIDC4 is computed using one, two and four GPUs. 



# CPUs 


Speedup 


1 


1.0 


2 


1.94 


4 


3.94 



# CPUs & GPUs 


Speedup 


1 


1.0 


2 


1.98 


4 


3.95 



Table III: Speedup of RIDC for Burgers' equation. 

V. Conclusions 

In this paper, we further developed RIDC algorithms 
to generate a family of high order semi-implicit parallel 
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(a) Convergence study: error versus number of time steps 
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-A-RIDC4 — 4 GPUs 
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(b) eiTor versus wall clock time 

Figure 6: In (a), we show that the ARK schemes and 
our RIDC4-FBE scheme achieve the designed orders of 
accuracy. The plot in (b) shows the error as a function of 
wall clock time. Two observations should be made: (i) for 
all the schemes, our GPU implementation is approximately 
an order of magnitude faster than the CPU implementation. 



integrators. The analysis related to convergence is a simple 
extension from previous work, and the numerical experi- 
ments demonstrate that the fourth order RIDC-FBE algo- 
rithm achieves its designed order of accuracy. Additionally, 
we showed that our semi-implicit PJDC algorithm har- 
nessed the computational potential of four GPUs by utilizing 
OpenMP coupled with with the CUBLAS library. This semi- 
implicit RIDC algorithm can potentially be coupled with 
existing legacy parallel spatial codes. Work is on-going to 
explore a hybrid MPI-OpenMP-CUDA algorithm for more 
heterogeneous architectures. 
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